Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

نویسندگان

Jiayuan Chao

Zhenghua Li

Wenliang Chen

Min Zhang

چکیده

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (WB, 10K sentences), Penn Chinese Treebank 7.0 (CTB7, 50K), and People’s Daily (PD, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine WB, CTB7, and PD, boosting F1 score from 93.76% (baseline model trained on only WB) to 95.58% (+1.82%). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert PD into the annotation style of CTB7 based on coupled sequence labeling, denoted by PD. Then, we merge CTB7 and PD to train a POS tagger, denoted by TagCTB7+PDCTB , which is further used to produce guide features on WB. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Fast Coupled Sequence Labeling on Heterogeneous Annotations via Context-aware Pruning

The recently proposed coupled sequence labeling is shown to be able to effectively exploit multiple labeled data with heterogeneous annotations but suffer from severe inefficiency problem due to the large bundled tag space (Li et al., 2015). In their case study of part-ofspeech (POS) tagging, Li et al. (2015) manually design context-free tag-to-tag mapping rules with a lot of effort to reduce t...

متن کامل

Character-Level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

Recent work on joint word segmentation, POS (Part Of Speech) tagging, and dependency parsing in Chinese has two key problems: the first is that word segmentation based on character and dependency parsing based on word were not combined well in the transition-based framework, and the second is that the joint model suffers from the insufficiency of annotated corpus. In order to resolve the first ...

متن کامل

Learning Chinese language structures with multiple views

Motivated by the inadequacy of single view approaches in many areas in NLP, we study multi-view Chinese language processing, including word segmentation, part-of-speech (POS) tagging, syntactic parsing and semantic role labeling (SRL), in this thesis. We consider three situations of multiple views in statistical NLP: (1) Heterogeneous computational models have been designed for a given problem;...

متن کامل

A Maximum Entropy Tagger with Unsupervised Hidden Markov Models

We describe a new tagging model where the states of a hidden Markov model (HMM) estimated by unsupervised learning are incorporated as the features in a maximum entropy model. Our method for exploiting unsupervised learning of a probabilistic model can reduce the cost of building taggers with no dictionary and a small annotated corpus. Experimental results on English POS tagging and Japanese wo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

نویسندگان

چکیده

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Fast Coupled Sequence Labeling on Heterogeneous Annotations via Context-aware Pruning

Character-Level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

Learning Chinese language structures with multiple views

A Maximum Entropy Tagger with Unsupervised Hidden Markov Models

عنوان ژورنال:

اشتراک گذاری